Goto

Collaborating Authors

 definition 3


Latent Process Generator Matching

arXiv.org Machine Learning

A related situation arises when an auxiliary process is introduced to aid training but modelling its dynamics at generation time is unnecessary or difficult, as in Billera et al. [2025b] and Kim et al. [2025]. In each of these works, the projection result and its associated loss are derived on a case-by-case basis, and all theorems are restricted to marginalization over a discrete component of the extended state space. We introduce a general framework that removes these restrictions: given a time-inhomogeneous Feller process (Yt)0 t 1 on an arbitrary state space Y and a map Φ: Y X, one may learn a linear parametrisation of the generator of a Feller process on X whose one-time marginals coincide with those of (Φ(Yt))0 t 1. For Y = X Z and Φthe projection onto the first coordinate, this subsumes these prior works as special cases, allowing for a general class of latent processes (Zt)0 t 1 in a nearly arbitrary state space Z, using the formalism of generator matching to allow for continuous, discrete, or manifold-valued processes. In particular, the learnt process at t = 1 samples from the distribution of Φ(Y1), which is the desired data distribution. We give sufficient conditions for a loss function to be valid in this general setting, recovering the results of the works cited above as corollaries. This result has broad applicability, enabling the construction of a wide array of new flow matching schemes by allowing for a more general class of latent spaces. As a concrete new application, we outline a non-projection Φ: Y X with manifold-valued latents for protein structure generation that separates chain-level rigid-body motion from internal flexibility ( 4), where the particular chain-level versus residue-level or internal state is latent, and the model only sees the world state, which we plan to implement in future work. 2 EARLIERWORK Several recent generative models train with the aid of a latent stochastic process that is marginalised out at generation time.


Density-Ratio Losses for Post-Hoc Learning to Defer

arXiv.org Machine Learning

We study post-hoc Learning to Defer (L2D) through the lens of ideal distributions: divergence-regularized reweightings of the data distribution under which a model attains low loss. We define deferral via the density-ratio between a model's and an expert's ideals. Using the reduction from density-ratio estimation to class-probability estimation, we derive the DR CPE losses for post-hoc L2D scorers. Deferral decisions are then made by thresholding the scorer, allowing deferral rates to be adjusted without retraining. For KL-based ideal distributions, our deferral rules recovers Chow's rule under the original distribution and a connection to an expert-tilted Bayes posterior -- which incorporates the expert's performance -- depending on if the ideal distributions are joint or marginal distributions. Experimentally, our approach is competitive compared to common baselines and more robust across dataset settings. More broadly, our results cast post-hoc L2D as density-ratio learning between ideal distributions, bridging Chow-style rules, expert comparison, and elucidating connections to related learning settings including anomaly detection.


One Operator for Many Densities: Amortized Approximation of Conditioning by Neural Operators

arXiv.org Machine Learning

Probabilistic conditioning is concerned with the identification of a distribution of a random variable $X$ given a random variable $Y$. It is a cornerstone of scientific and engineering applications where modeling uncertainty is key. This problem has traditionally been addressed in machine learning by directly learning the conditional distribution of a fixed joint distribution. This paper introduces a novel perspective: we propose to solve the conditioning problem by identifying a single operator that maps any joint density to its conditional, thus amortizing over joint-conditional pairs. We establish that the conditioning operator can be approximated to arbitrary accuracy by neural operators. Our proof relies on new results establishing continuity of the conditioning operator over suitable classes of densities. Finally, we learn the conditioning map for a class of Gaussian mixtures using neural operators, illustrating the promise of our framework. This work provides the theoretical underpinnings for general-purpose, amortized methods for probabilistic conditioning, such as foundation models for Bayesian inference.


Muon Does Not Converge on Convex Lipschitz Functions

arXiv.org Machine Learning

Muon and its variants have shown strong empirical performance in a variety of deep learning tasks. Existing convergence analyses of Muon rely on smoothness assumptions, though arguably the most successful function class for developing deep learning methods (such as AdaGrad, Shampoo, Schedule-Free and more) has been the class of convex and Lipschitz functions. In this paper we question whether the classical convex Lipschitz model is a useful one for understanding Muon. Our answer is no. We show that Muon does not converge on the class of convex and Lipschitz functions, regardless of the choice of learning rate schedule. We also show that error feedback restores convergence of Muon and all the non-Euclidean subgradient methods with momentum. However, this theoretical fix using error feedback degrades the performance of Muon in two representative settings for image classification (CIFAR-10) and language modeling (nanoGPT on FineWeb-Edu 10B). Our conclusion is that convex Lipschitz theory, despite having a prominent role in the design of practical methods for deep learning, is not the most suited one for Muon. This suggests that Muon's success must come from structure absent from this model, most plausibly related to smoothness.


A Theory of Saddle Escape in Deep Nonlinear Networks

arXiv.org Machine Learning

In deep networks with small initialization, training exhibits long plateaus separated by sharp feature-acquisition transitions. Whereas shallow nonlinear networks and deep linear networks are well studied, extending these analyses to deep nonlinear networks remains challenging. We derive an exact identity for the imbalance of Frobenius norms of layer weight matrices that holds for any smooth activation and any differentiable loss and use this to classify activation functions into four universality classes. On the permutation-symmetric submanifold, the identity combines with an approximate balance law to reduce the full matrix flow to a scalar ODE, giving a critical-depth escape time law $τ_\star = Θ(\varepsilon^{-(r-2)})$ governed by the number $r$ of layers at the bottleneck scale rather than the total depth $L$. We find that this same $r-2$ exponent is recovered under He-normal initialization with $r$ bottleneck layers rescaled by $\varepsilon$, where the symmetry manifold is preserved by the flow but not attracting. We find close agreement between our theory and numerical simulations.


ec4f0b0a7557d6a51c42308800f2c23a-Supplemental-Conference.pdf

Neural Information Processing Systems

Let (x,y)be a binary classification task that admits a smooth separator as in Assumption 1. Then, there exists an RLC with neural network fθ and absolutely continuous randomness source u (Assumption 2) that is universal in the limit, i.e., Fθ (x) = y(x), x X, and makes random predictions that are correct with probability P(maj({sgn( a Further, if p is the number of parameters used by a deterministic neural network with one hidden layer to achieve zero-error in the task, fθ has at most p p +O(1)parameters. Since Assumption 1 holds3, there exists a single hidden-layer neural network N that, like s, achieves zero-error in this task [8]. Further, since sgn is nonpolynomial, we can use it as the nonlinearity of this network [21]. Putting it all together, there exists a number of hidden units M and parameters bj,oj R,wj Rd for j = 1,...,M such that N(x):= Note that this means we can achieve zero-error in classification, N(x) = y(x), x X.


How to Scale Your EMA

Neural Information Processing Systems

Preserving training dynamics across batch sizes is an important tool for practical machine learning as it enables the trade-off between batch size and wall-clock time. This trade-off is typically enabled by a scaling rule, for example, in stochastic gradient descent, one should scale the learning rate linearly with the batch size. Another important machine learning tool is the model EMA, a functional copy of a target model, whose parameters move towards those of its target model according to an Exponential Moving Average (EMA) at a rate parameterized by a momentum hyperparameter. This model EMA can improve the robustness and generalization of supervised learning, stabilize pseudo-labeling, and provide a learning signal for Self-Supervised Learning (SSL). Prior works have not considered the optimization of the model EMA when performing scaling, leading to different training dynamics across batch sizes and lower model performance. In this work, we provide a scaling rule for optimization in the presence of a model EMA and demonstrate the rule's validity across a range of architectures, optimizers, and data modalities. We also show the rule's validity where the model EMA contributes to the optimization of the target model, enabling us to train EMA-based pseudo-labeling and SSL methods at small and large batch sizes. For SSL, we enable training of BYOL up to batch size 24,576 without sacrificing performance, a 6 wall-clock time reduction under idealized hardware settings.




related work

Neural Information Processing Systems

Deterministic RLDeterministic system is often the starting case in the study of sample-efficient algorithms, where the issue of exploration and exploitation trade-off is more clearly revealed since both the transition kernel and reward function are deterministic. The seminal work [81] proposes a sample-efficient algorithm for Q-learning that works for a family of function classes. Recently, [21] studies the agnostic setting where the optimal Q-function can only be approximated by a function class with approximation error. The algorithm in [21] learns the optimal policy with the number of trajectories linear with the eluder dimension. Consider MDPM where the transition is deterministic. Assume the function class in Definition 3.1 satisfies Assumption 2.1 and Assumption 2.2. For any t (0,1), if d Ω(log(BW/λ))and n d poly(κ,k,λ,BW,Bϕ,H,log(d/t)), then with probability at least 1 tAlgorithm 1 returns the optimal policy π .